Forecast visualisation
Forecasts of cases/deaths per week per 100,000. The date of the tab marks the date on which a forecast was made (only the latest forecasts and the previous 4 weeks shown).
2021-07-12
Cases

Deaths

2021-07-05
Cases

Deaths

2021-06-28
Cases

Deaths

2021-06-21
Cases

Deaths

Forecast scores
Scores separated by target and forecast horizon. Only models with submissions in each of the last 4 weeks are shown.
Evaluation metrics
- The first column (n) gives the number of forecasts included in the evaluation. This number may vary across models as some models joined later than others, or models may not have submitted forecasts in certain weeks.
- Relative skill (rel_skill) is a relative measure of forecast performance which takes into account that different teams may not cover the exact same set of forecast targets (i.e., weeks and locations). Loosely speaking, a relative skill of X means that averaged over the targets a given team addressed, its weighted interval score (see below) was X times higher/lower than the the average performance of all models. Smaller values are thus better and a value below one means that the model has above average performance. The relative skill is computed using a ‘pairwise comparison tournament’ where for each pair of models a mean score ratio is computed based on the set of shared targets. The relative skill is the geometric mean of these ratios. Details on the computation can be found in this preprint.
- Coverage (50% Cov. / 95% Cov.) is the proportion of observations that fell within a given prediction interval. Ideally, a forecast model would achieve 50% coverage of 0.50 (i.e., 50% of observations fall within the 50% prediction interval) and 95% coverage of 0.95 (i.e., 95% of observations fall within the 95% prediction interval). Values of coverage greater than these nominal values indicate that the forecasts are underconfident, i.e. prediction intervals tend to be too wide, whereas values of coverage smaller than these nominal values indicate that the forecasts are overconfident, i.e. prediction intervals tend to be too narrow.
- The weighted interval score (wis) is a proper scoring rule (i.e., it cannot be “cheated”) that is suited to scoring forecasts in an interval format. It generalizes the absolute error (i.e. lower values are better) and has three components: dispersion, underprediction and overprediction. Dispersion is a weighted average of the widths of the submitted prediction intervals. Over- and underprediction (overpred/underpred) penalties are added whenever an observation falls outside of a reported central prediction interval, with the strength of the penalty depending on the nominal level of the interval and how far outside of the interval the observation fell. Note that the average WIS can refer to different sets of targets for different models and therefore cannot always be compared across models. Such comparisons should be done based on the relative skill.
- bias is a measure between -1 and 1 that expresses the tendency to underpredict (-1) or overpredict (1). In contrast to the over- and underprediction components of the WIS it is bound between -1 and 1 and cannot go to infinity. It is therefore less susceptible to outliers.
- aem is the mean absolute error of the predictive medians. A high aem means the median forecasts tend to be far away from the true values. Again the average may not refer to the same set of targets for different models, meaning that values cannot always be compared.
Scores over time (Luxembourg)
Visualisation of the weighted interval score over time. In addition, the components of the interval score, sharpness (how narrow are forecasts - smaller is better), and penalties for underprediction and overprediction are shown. Scores are again separated by forecast horizon.
1 weeks ahead horizon

Weighted interval score

Overprediction

Underprediction

Sharpness

2 weeks ahead horizon

Weighted interval score

Overprediction

Underprediction

Sharpness

3 weeks ahead horizon

Weighted interval score

Overprediction

Underprediction

Sharpness

4 weeks ahead horizon

Weighted interval score

Overprediction

Underprediction
